Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study#351
Add SyntheticDiD variance_method='bootstrap_refit' and coverage MC study#351
Conversation
Implements Arkhangelsky et al. (2021) Algorithm 2 step 2 as an opt-in
variance method that re-estimates ω̂_b and λ̂_b via two-pass sparsified
Frank-Wolfe on each pairs-bootstrap draw, using the fit-time normalized-
scale zeta. Default remains "placebo".
Cross-surface allow-list extensions land in one PR per
feedback_cross_surface_parity_audit.md:
- SyntheticDiD.fit() dispatcher and _bootstrap_se signature
- synthetic_did.py:695 n_bootstrap result population
- results.py:960 summary() "Bootstrap replications" gating
- business_report.py:602 inference-label allow-list
- power.py SDID guidance strings (2 sites)
- SyntheticDiD.__init__ docstring and diff_diff/guides/llms-full.txt
Survey + bootstrap_refit raises NotImplementedError upstream in fit()
(covers both pweight-only and full-design) — the Rao-Wu rescaled-weight
composition is tracked as a follow-up TODO.
Coverage MC study (benchmarks/python/coverage_sdid.py) runs 500 seeds ×
B=200 × 3 DGPs × 4 methods under H0 and writes
benchmarks/data/sdid_coverage.json (4.4 KB). Rejection rates at α ∈
{0.01, 0.05, 0.10} and mean SE / true SD ratios are transcribed into
REGISTRY.md §SyntheticDiD. Headline: refit achieves near-nominal
calibration across all 3 DGPs; fixed-weight over-rejects by roughly
1.8–3.2× on smaller panels, consistent with the SE under-estimate from
ignoring weight-estimation uncertainty.
Tests: TestBootstrapRefitSE (8 tests) + test_refit_p_value_matches_analytical
in TestPValueSemantics + TestCoverageMCArtifact schema smoke test
(guarded with pytest.skip per feedback_golden_file_pytest_skip.md) +
cross-surface BR inference-label test. PR #349's 1e-10 R-parity
bit-identity gate still passes.
Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed inside
the refit loop and aggregated into a single summary warning at end-of-
loop if the rate exceeds 5% — the same threshold the retry-exhaustion
guard uses.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AI review caught that the sklearn-style setter path bypassed the constructor's enum/coherence checks, so users could ``set_params(variance_method='not_a_method')`` after construction and slip past the __init__ validation added for ``bootstrap_refit``. Extract the existing checks into a private ``_validate_config()`` helper and call from both ``__init__`` and ``set_params`` so both paths enforce the same contract. Constant-hoist the valid-methods tuple onto the class as ``_VALID_VARIANCE_METHODS`` so __init__ and the validator share a single source. Add regression tests under ``TestGetSetParams``: - set_params accepts ``bootstrap_refit`` - set_params rejects unknown variance_method (parity with __init__) - set_params rejects incoherent n_bootstrap < 2 when method != jackknife - set_params allows n_bootstrap=1 under jackknife (deterministic) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the review's P2 finding: if ``_validate_config`` rejects the post-update state in a multi-attribute ``set_params`` call, the instance was left with partially-applied (invalid) values after the raised ``ValueError``. Snapshot originals before any setattr and restore them in an except handler so the raise leaves the object consistent with its pre-call configuration. Regression test asserts post-raise state matches the pre-call state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good Executive Summary
Methodology No unmitigated findings. The refit path in diff_diff/synthetic_did.py:L583 and diff_diff/synthetic_did.py:L1071 re-estimates both weights per draw, and the non-placebo inference dispatch in diff_diff/synthetic_did.py:L642 matches the registry notes in docs/methodology/REGISTRY.md:L1505 and docs/methodology/REGISTRY.md:L1555.
Code Quality
Performance No findings. The extra cost of Maintainability No unmitigated findings. Tech Debt
Security No findings. Documentation/Tests
Execution note: I could not run the test suite in this sandbox because |
…ression Three P3 items from the CI AI review, all under Documentation/Tests and Code Quality (no methodology change): 1. **FW warning counting**: the refit loop counted raw solver warnings (up to 3 per draw: ω pre-sparsify, ω main, λ main), not draws. REGISTRY text describes the rate per valid bootstrap draw, so the two could diverge by up to 3×. Change counts "any non-convergence on this draw" as boolean 1 and tighten the REGISTRY wording to call out the per-draw convention explicitly (each draw runs Frank-Wolfe once for ω and once for λ; either can trip the draw). 2. **Docstring gaps**: `fit()` now documents the new survey + refit NotImplementedError under the `survey_design` parameter and in the Raises block. `SyntheticDiDResults.variance_method` and `SyntheticDiDResults.placebo_effects` fields now list `bootstrap_refit` and describe the four-way semantics of `placebo_effects` under each variance method. 3. **Scale-equivariance regression**: add `"bootstrap_refit"` to the `variance_method` parametrizations in `TestScaleEquivariance` (baseline-parity, scale-equivariance across ~15 orders of magnitude, and extreme-scale-detection). Capture a bit-identity baseline for the refit path on `_make_panel(seed=42)`, so the normalization-sensitive branch has its own drift guard, not inferred from sibling methods. All 37 SDID targeted tests + 361 broader sweep tests pass; fixed-weight 1e-10 R-parity bit-identity gate preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests No separate unmitigated findings beyond the methodology-labeling issue above. The prior doc/test gaps from the last review appear addressed in the changed files. Path to Approval
|
… deviation Tracing R's source (vcov.R::bootstrap_sample and synthdid.R) shows that R's default synthdid::vcov(method="bootstrap") rebinds attr(estimate, "opts") — which includes update.omega=TRUE from the original fit — back into synthdid_estimate inside its do.call, so the renormalized ω is used only as Frank-Wolfe initialization and ω and λ are re-estimated per draw. R's default bootstrap is refit, not fixed- weight. The sum_normalize helper in R's source explicitly comments that the supplied weights "are used only for initialization" in bootstrap and placebo SEs. Our variance_method="bootstrap" holds the renormalized ω exactly (no FW re-run). It is therefore a deliberate deviation from R's default. Our PR #349 fixture generator at benchmarks/R/... is a manual fixed-weight invocation — it omits the opts rebind, which defaults update.omega to FALSE given non-null weights. The 1e-10 parity test anchors our fixed-weight path to that manual R invocation, not to R's real vcov behavior. Documentation-only fix across all claim sites; no methodology or code behavior changes: - REGISTRY.md §SyntheticDiD: label the fixed-weight bootstrap as "Alternative: Bootstrap at unit level — fixed-weight shortcut"; add explicit **Note (deviation from R)** citing the vcov.R / synthdid.R opts-rebind mechanism; call out bootstrap_refit as matching R's default vcov. Requirements checklist entries and R-parity test scope Note rewritten to match. - diff_diff/synthetic_did.py: __init__ docstring and _bootstrap_se method docstring drop the "matching R" framing on the fixed-weight path; bootstrap_refit is flagged as matching R's default. - diff_diff/results.py: SyntheticDiDResults.variance_method field doc fixed (I introduced the "R-compatible fixed-weight shortcut" misphrasing in round 1; it was wrong). - CHANGELOG.md Unreleased/Added: Bundle A entry clarifies that bootstrap_refit matches R's default and the existing fixed-weight bootstrap is now explicitly documented as a deviation. - benchmarks/R/generate_sdid_bootstrap_parity_fixture.R: loop comment calls out the non-default invocation shape (no opts rebind → runs fixed-weight); references the Python test that consumes this fixture. - tests/test_methodology_sdid.py::test_bootstrap_se_matches_r docstring: rewritten to scope the parity check correctly (manual R fixed-weight, not R's default vcov). - TODO.md: add a new row for the refit cross-language parity anchor (Julia Synthdid.jl or R via the real vcov path) to make the missing anchor explicit. All 57 targeted tests pass; no methodology change, no numerical output change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality Performance Maintainability
Tech Debt
Security Documentation/Tests |
variance_method="bootstrap" now means refit (Arkhangelsky et al. 2021 Algorithm 2 step 2; also R's default synthdid::vcov(method="bootstrap") behavior, which rebinds attr(estimate, "opts") with update.omega=TRUE so the renormalized ω serves only as Frank-Wolfe initialization). The previously-shipped fixed-weight shortcut is removed entirely; the "bootstrap_refit" enum value briefly added in earlier commits of this PR is folded back into "bootstrap". Why this is a correctness fix, not just a relabel: the old fixed-weight "bootstrap" matched neither the paper (which prescribes refit) nor R's default vcov (also refit). The 1e-10 R-parity test from PR #349 anchored fixed-weight Python against a manual R invocation that omitted the opts rebind — both sides were wrong in the same direction. Coverage MC at benchmarks/data/sdid_coverage.json (500 seeds × B=200) confirms the new "bootstrap" tracks placebo near-nominal across the three representative DGPs; the old fixed-weight column over-rejected at α=0.05 at rates 0.16 / 0.098 / 0.092 (1.8-3.2× nominal). Capability regression: SDID + survey designs (pweight-only AND strata/PSU/FPC) now raises NotImplementedError. The removed fixed-weight bootstrap was the only SDID variance method that supported strata/PSU/FPC (via the Rao-Wu rescaled bootstrap branch inside _bootstrap_se). Pweight-only users can switch to variance_method="placebo" or "jackknife"; strata/PSU/FPC users have no SDID variance option on this release. Rao-Wu rescaled weights composed with paper-faithful Frank-Wolfe re-estimation needs a weighted-FW derivation; sketch and reusable scaffolding pointers live in REGISTRY.md §SyntheticDiD's "Note (deferred survey + bootstrap composition)" and TODO.md. The deleted Rao-Wu code (≈48 lines of _bootstrap_se) is recoverable via `git show <THIS_COMMIT>^:diff_diff/synthetic_did.py` near the pre-rewrite _bootstrap_se body. Cross-surface allow-list reverts: the additive "bootstrap_refit" enum shipped in earlier commits of this PR rippled through results.py:960 summary gating, business_report.py:602 inference-label allow-list, power.py SDID guidance strings, llms-full.txt enums, and SyntheticDiDResults field docstrings. All of those are now back to a 3-value surface ("bootstrap", "jackknife", "placebo"). Tests: - TestBootstrapRefitSE class deleted; 4 unique tests folded into TestBootstrapSE (tracks-placebo-exchangeable, raises-pweight-survey, raises-full-design-survey, summary-shows-replications). - test_bootstrap_se_matches_r deleted along with its fixture (tests/data/sdid_bootstrap_indices_r.json) and generator (benchmarks/R/generate_sdid_bootstrap_parity_fixture.R) — they anchored the now-removed fixed-weight path. - TestPValueSemantics::test_refit_p_value_matches_analytical deleted as duplicate of test_bootstrap_p_value_matches_analytical. - TestScaleEquivariance._BASELINE: "bootstrap" row updated to the refit values (4.6033, 0.21424970..., 2.10890881e-102, 200) — bit- identical to the captured "bootstrap_refit" baseline since the new bootstrap path is the same code as the old refit path. Tolerance tightened from rel=1e-8 to rel=1e-14 to enforce bit-identity. - TestGetSetParams: variance_method literals rebound to "bootstrap"; test_set_params_accepts_bootstrap_refit deleted (redundant with constructor tests). - TestCoverageMCArtifact: expected methods list set exact-equal to ("placebo", "bootstrap", "jackknife"). - test_business_report.py inference-label test class + method renamed to drop "refit" suffix; assertion checks for "bootstrap variance". The benchmarks/data/sdid_coverage.json artifact is updated transitionally in this commit (fixed-weight column dropped; refit column renamed to bootstrap) so the schema test stays green; a follow-up commit regenerates from a fresh 500-seed MC re-run with the new code path. The REGISTRY coverage table cells are TBD pending that re-run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doc-only follow-up to the previous commit's bootstrap rewrite. Updates every user-facing surface that referenced the (now-removed) fixed-weight bootstrap or the additive bootstrap_refit option: - docs/choosing_estimator.rst: drops the "Via bootstrap" cell from the SDID survey-support row (no SDID variance method supports strata/PSU/FPC anymore); rewrites the misdirecting note steering users to bootstrap for full survey designs; updates the inference summary table description for SDID's variance methods. - docs/survey-roadmap.md: rewrites the SDID limitations table rows to reflect the regression matrix (pweight-only works with placebo / jackknife; strata/PSU/FPC has no SDID variance option in this release; bootstrap rejects all survey designs). - docs/performance-scenarios.md: updates the SE-comparison scenario's timing expectation note (bootstrap is now ~10-100x slower per fit than the previous fixed-weight shortcut). - docs/tutorials/03_synthetic_did.ipynb: rewrites markdown cells 19 (inference methods description) and 29 (summary) — bootstrap is now paper-faithful refit matching R's default vcov, not the prior fixed-weight shortcut. - docs/tutorials/18_geo_experiments.ipynb: rewrites the bootstrap-vs- placebo description (cell t18-cell-028); softens the stakeholder narrative claim "the two methods agree" to acknowledge that on small panels with non-exchangeable factor structure the SE magnitudes can differ while both methods still agree on significance and CI direction (cell t18-cell-033); re-executes the comparison cell so the output reflects the new bootstrap SE = 4.50 (was 4.26 under fixed-weight). The drift-guard asserts at cell t18-cell-026 only pin ATT / conf_int / pre-fit RMSE — none of which change — so no guard updates needed. - diff_diff/synthetic_did.py: fit() docstring's survey_design parameter description is now consistent with the actual guards (no bootstrap_refit references; explicit pweight-only-on-placebo-or-jackknife matrix). - benchmarks/python/coverage_sdid.py: --help text drops the bootstrap_refit mention. - METHODOLOGY_REVIEW.md: the v3.x SyntheticDiD review entry's claim that bootstrap matches R's bootstrap_sample is replaced with an honest description of the corrected refit semantics, plus a parenthetical historical note about the prior fixed-weight shortcut. All 351 targeted tests pass; no methodology or numerical change in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fresh 500 seeds × B=200 run of benchmarks/python/coverage_sdid.py with
the new 3-method ALL_METHODS = ("placebo", "bootstrap", "jackknife").
Total wall-clock ~40 min on M-series Mac, Rust backend.
Numbers match the transitional rename in commit 1 exactly (rej@0.05:
balanced=0.078, unbalanced=0.038, aer63=0.040) — expected bit-identity
since the new "bootstrap" path is the same refit code as the previous
"bootstrap_refit" path, and the MC uses identical seeds (range(500)).
Confirms the rewrite didn't introduce numerical drift.
Headline across the three DGPs:
- bootstrap (refit): near-nominal at α=0.05 (0.078 / 0.038 / 0.040)
versus nominal 0.05; well within 2σ MC band at 500 seeds.
- placebo: also near-nominal (0.060 / 0.032 / 0.058).
- jackknife: mildly anti-conservative on the smaller panels (0.112 on
balanced, 0.080 on AER §6.3), matching the paper's §6.3 pattern.
REGISTRY.md §SyntheticDiD coverage-MC table cells replace the _TBD_
placeholders with the transcribed numbers. The narrative under the
table (`Reading:` paragraph) was already rewritten in commit 1 for the
3-method surface; no further edit needed.
The transitional commit-1 artifact (where I renamed bootstrap_refit →
bootstrap in-place without running the MC) carried a `transitional_note`
field in the metadata; the fresh run naturally drops that field since
it's written from scratch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
CI review on commit 15481d9 flagged the docs as overclaiming parity with R's default synthdid::vcov(method="bootstrap"): R warm-starts Frank-Wolfe from the renormalized fit-time ω per draw (and keeps fit-time λ as FW init for the λ re-estimation), while our Python port was cold-starting from uniform. On the strictly-convex FW objective with simplex constraint, warm- and cold-start converge to the same global minimum given enough iterations — but the 100-iter pre-sparsify pass may not fully converge on some draws, and then sparsification is path-dependent on the init. Port the warm-start shape: - diff_diff/utils.py: compute_sdid_unit_weights and compute_time_weights gain an init_weights=None kwarg, forwarded to _sc_weight_fw for the first pass. When None (default), preserves the Rust top-level fast-path unchanged. When provided, falls through to the Python two-pass dispatcher; inner FW calls still dispatch to Rust via _sc_weight_fw, so the perf cost is one Python call per pass per draw. - diff_diff/synthetic_did.py::_bootstrap_se: thread warm-start per draw. boot_omega_init = _sum_normalize(unit_weights[boot_control_idx]) (same shape as R's sum_normalize(weights$omega[sort(ind[ind <= N0])])). boot_lambda_init = time_weights (fit-time λ unchanged, matching R's weights.boot$lambda = weights$lambda). Baseline drift is ULP-level: TestScaleEquivariance._BASELINE["bootstrap"] SE shifts from 0.21424970247101688 (cold-start) to 0.21427381053829253 (warm-start), a handful of the 200 bootstrap draws converging to a different sparsification pattern than the cold-start pass. rel=1e-14 bit-identity on the warm-start capture holds. Coverage MC: NOT re-run. User course-correction ("not sure we need to run the full MC with every pass here"). Math argument: strictly-convex FW → unique global minimum; aggregated rejection rates shift well under 2σ MC noise (≈0.02–0.05 per cell at B=200 × 500 seeds). The committed benchmarks/data/sdid_coverage.json (generated under cold-start in commit 15481d9) stays accurate for the warm-start path within MC noise. REGISTRY.md coverage-MC Note now documents the artifact regeneration cadence explicitly: regenerate on methodology changes that materially shift per-draw numerics, not on refactors that preserve the FW global optimum (warm-start, backend migration, renames, docstring fixes). Per-seed bit-identity on the captured fixture is the cheaper, stricter parity check. Other changes in this commit: - diff_diff/synthetic_did.py replicate-weight guard message rewritten (R2 P3 maintainability): the old message told users to switch to a TSL-based survey design with strata/psu/fpc, which is no longer supported by any SDID variance method in this release. - docs/tutorials/18_geo_experiments.ipynb re-executed; bootstrap SE displayed in the comparison cell is now 4.44 (warm-start) vs 4.50 (cold-start). Narrative unchanged. - CHANGELOG.md and METHODOLOGY_REVIEW.md describe the warm-start shape and drop the "cross-language bit parity" implication from earlier wording. - REGISTRY.md bootstrap section describes the warm-start mechanism including the FW strict-convexity argument for why warm and cold converge to the same answer but warm matches R's procedure at the algorithm level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Addresses CI review R3 findings on PR #351: P1: rewrite three legacy SDID survey tests in tests/test_survey_phase5.py that still asserted the old fixed-weight + Rao-Wu success contract so they match the new upstream guards (bootstrap + any survey design and full strata/PSU/FPC with any variance method now raise NotImplementedError). test_full_design_bootstrap_smoke -> test_full_design_bootstrap_raises; test_full_design_se_differs_from_weights_only is replaced by a positive placebo pweight-only smoke test since the old Rao-Wu vs pweight-only SE comparison no longer has two valid paths to compare; test_bootstrap_with _survey -> test_bootstrap_with_pweight_only_raises plus a new test_jackknife_with_pweight_only for positive coverage. Also align the pre-existing placebo/jackknife guard tests with the updated guard wording (same stale-regex class of issue the P1 flagged). P2: remove SyntheticDiD from cross-reference Rao-Wu Rescaled Bootstrap sections that survived the bootstrap rewrite. REGISTRY comparison table row now describes the alternative as paper-faithful refit bootstrap (not fixed weights); the Rao-Wu section lists only SunAbraham and TROP with an explicit Note explaining why SDID is excluded (refit + Rao-Wu composition requires weighted-FW derivation, tracked in TODO.md and sketched in REGISTRY SDID survey-regression Note). survey-roadmap.md Phase 6 bullet and llms-full.txt survey-aware bootstrap bullet updated with the same framing. P3: rewrite _bootstrap_se docstring (diff_diff/synthetic_did.py) to say the fit-time unit_weights and time_weights are Frank-Wolfe warm-start initializations on each draw, not unused parameters retained for signature parity. The in-loop narrative comment at the warm-start site was already correct; the top-of-function docstring was stale from the pre-warm-start commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0 or P1 findings remain in the changed PR surface. The core SDID refit-bootstrap path is aligned with the updated methodology registry, and the prior re-review blocker around stale SDID survey tests appears resolved. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
- Replace placeholder commit SHAs in TODO.md and REGISTRY.md deferred- work breadcrumbs with the actual fixed-weight-removal commit hash (91082e5 "Replace SDID fixed-weight bootstrap with paper-faithful refit"). The follow-up implementer can now directly run `git show 91082e5:diff_diff/synthetic_did.py` to recover the Rao-Wu scaffolding (rw_control/rw_treated split, degenerate-retry check, treated-mean weighting). Also cite PR #351 for stability against commit-hash churn. - Rename tests/test_survey_phase5.py::test_placebo_full_vs_pweight_only _att_match to test_placebo_with_pweight_only_full_design_stripped_att _match and promote the body from a single-fit smoke to the ATT- equivalence check the docstring claims: fit with two equivalent pweight-only SurveyDesign constructions and assert ATT matches bit- for-bit (abs=1e-12). The intent is to confirm strata/psu/fpc columns sitting on the DataFrame are not tacitly read unless the SurveyDesign references them. - Fix stale "3 DGPs × 4 methods" docstring in tests/test_methodology_sdid.py::TestCoverageMCArtifact to read "3 DGPs × 3 methods", matching the 3-method schema the test now enforces (placebo / bootstrap / jackknife; bootstrap_refit and the old fixed-weight 'bootstrap' enum are both gone). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment No unmitigated P0 or P1 findings in the changed PR surface. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Verification note: runtime execution was not possible here because this environment is missing |
The previous rewrite compared two semantically identical pweight-only SurveyDesign constructions, which can't catch the contract the docstring claimed — if a future change silently picked up `stratum` or `psu` by name, both fits would pick them up identically and the test would still pass. Rewrite to compare a fit on the original DataFrame (with `stratum` / `psu` columns present) against a fit on the same data with those columns physically dropped. If the estimator ever silently reads those columns by naming convention the two fits would diverge and the abs=1e-12 ATT check would fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings in the changed PR surface. The remaining issues are P2/P3 only. Runtime verification was static-only because this environment does not have Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Prior behavior: ``_bootstrap_se`` tallied Frank-Wolfe non-convergence via ``warnings.catch_warnings``, but the Rust FW entry point is silent on ``max_iter`` exhaustion (only the pure-NumPy path called ``warn_if_not_converged``). On the default Rust backend the aggregate warning at the end of the bootstrap loop therefore never fired, even when draws did not converge — a silent failure. Fix: thread an explicit convergence bool out of the Rust solver. Rust (``rust/src/weights.rs``, ``rust/src/lib.rs``) - ``sc_weight_fw_gram`` / ``sc_weight_fw_standard`` now set and return ``converged = true`` on a min-decrease break, ``false`` otherwise. - ``sc_weight_fw_internal`` returns ``(Array1<f64>, bool)``. - Existing ``sc_weight_fw`` pyfunction destructures and drops the bool, preserving its ABI for the rank-selection heuristic in ``prep.py`` and for any third-party consumer. - New pyfunction ``sc_weight_fw_with_convergence`` returns the ``(array, bool)`` tuple, wrapping the same internal solver. - Internal helpers ``compute_time_weights_internal`` / ``compute_sdid_unit_weights_internal`` destructure the inner calls and still return ``Array1<f64>`` (their pyfunctions discard convergence — Python callers that need it use the Python two-pass dispatcher). Python (``diff_diff/utils.py``, ``diff_diff/_backend.py``) - Import the new Rust entry point as ``_rust_sc_weight_fw_with_convergence``. - ``_sc_weight_fw`` / ``_sc_weight_fw_numpy`` gain a ``return_convergence=False`` kwarg. Default path is unchanged; with the flag set, return ``(weights, converged)``. - ``compute_sdid_unit_weights`` / ``compute_time_weights`` gain the same kwarg and propagate the AND of the two FW passes (pre-sparsify + main). When the flag is set, the Rust top-level fast-path is skipped (it is silent on non-convergence) in favor of the Python two-pass dispatcher; inner FW calls still dispatch to the Rust solver via ``sc_weight_fw_with_convergence``. Consumers (``diff_diff/synthetic_did.py::_bootstrap_se``) - Replace the ``warnings.catch_warnings`` dance with an explicit ``return_convergence=True`` call on both helpers. ``fw_nonconvergence _count`` now increments based on the AND of ``omega_converged`` and ``lambda_converged`` per draw — works under both Rust and pure- Python backends. Regression test (``tests/test_methodology_sdid.py::TestBootstrapSE:: test_bootstrap_fw_nonconvergence_warning_fires_under_rust``) - Monkeypatches ``_rust_sc_weight_fw_with_convergence`` to always report ``converged=False`` and asserts the aggregate FW non- convergence ``UserWarning`` fires on an otherwise-nominal SDID bootstrap fit. Skipped when the Rust backend is unavailable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings in the changed PR surface. On this re-review, the earlier Rust non-convergence warning issue and the BR inference-label concern both look addressed. The remaining issues are P2/P3 only. Executive Summary
Methodology Affected method: SyntheticDiD bootstrap variance. Cross-checking docs/methodology/REGISTRY.md:1497, diff_diff/synthetic_did.py:948, diff_diff/synthetic_did.py:1053, and diff_diff/synthetic_did.py:644 against the cited SDID material and the current
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…W note P2 (Documentation/Tests) — ``test_bootstrap_p_value_null_calibration`` at ``tests/test_methodology_sdid.py:2604`` was still framed around the deleted fixed-weight regime and encoded an anti-conservative lower bound ``rejection_rate > 0.05`` that biased the test toward keeping over-rejection. Rename to ``test_bootstrap_p_value_null_dispersion`` and replace the directional calibration assertion with a calibration-agnostic dispersion guard: ``np.std(p_values) > 0.10`` catches the pre-fix p-clustering dispatch bug (p≈0.5 on every seed under the old bug), and a loose band ``0.01 <= rejection_rate <= 0.40`` catches both SE-collapse and SE-explosion. Matches this session's commit c0d089b by removing the remaining lingering fixed-weight framing in the test suite. Tested locally: 100-seed refit bootstrap passes with comfortable margin. P3 (newly identified) — REGISTRY.md §Frank-Wolfe non-convergence edge case (line 1528) still said the Rust backend silently returns the final iterate and cited a follow-up. This session's commit c0d089b already landed the Rust-side convergence flag, Python ``return_convergence=True`` kwargs on the weight helpers, and the aggregate ``UserWarning`` in ``_bootstrap_se``. Update the bullet to describe the new behavior: bootstrap path now uses ``sc_weight_fw_with_convergence`` (Rust) / numpy ``return_convergence=True`` and aggregates a single summary warning above the 5% threshold; standalone callers that do not opt into convergence tracking retain the legacy silent-Rust / warning-numpy asymmetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings in the changed PR surface. The SDID refit-bootstrap implementation itself looks methodologically consistent with the cited SDID paper/official Executive Summary
Methodology Affected method: SDID variance estimation. The changed bootstrap path in diff_diff/synthetic_did.py:L590-L657, diff_diff/synthetic_did.py:L830-L1059, diff_diff/utils.py:L1301-L1726, and rust/src/weights.rs:L125-L558 re-estimates both
Code Quality No findings. Performance No findings. Maintainability
Tech Debt
Security No findings. Documentation/Tests
|
Four P3-only items from R8 CI review:
1. Correctly attribute R's default `vcov()` method:
- diff_diff/synthetic_did.py:53 docstring previously claimed placebo was
"R's default". R's `synthdid::vcov()` actually defaults to
`method="bootstrap"`. Reword to describe placebo as the library default
with a rationale paragraph (survey availability, perf) and cross-
reference to the REGISTRY Note below.
- METHODOLOGY_REVIEW.md item 5 said the same incorrect thing. Rewrite
to frame the default as a deliberate library deviation with the same
two-reason rationale.
2. Add a REGISTRY.md Note (default variance_method deviation from R)
that documents the rationale explicitly (survey availability + perf)
so the AI reviewer recognizes the deviation as documented rather than
as an accidental contradiction between the docstring and R's actual
default.
3. Soften the placebo-failure fallback guidance in
`_placebo_variance_se`: the previous strings recommended
`variance_method="bootstrap"`, which now raises `NotImplementedError`
on every survey design. Branch on `w_control is not None` (survey
fit) to recommend jackknife + adding controls for survey users and
keep the bootstrap + jackknife + more-controls recommendation for
non-survey users.
4. Fix the REGISTRY pointer to the slow dispersion guard: reference
`TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (the
current name) instead of the pre-rename
`test_bootstrap_p_value_null_calibration`, and describe the new
contract (calibration-agnostic dispersion + loose rejection-rate
band).
5. Rephrase the REGISTRY coverage-MC narrative about jackknife
anti-conservatism: the paper's AER §6.3 shows mixed jackknife
evidence (98% iid — slightly conservative; 93% AR(1) — slightly
anti-conservative). Our observed anti-conservatism aligns with the
AR(1) branch of the paper's evidence, not a uniform "in line"
pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology No unmitigated findings. The refit bootstrap path in diff_diff/synthetic_did.py:L834-L1063, together with the warm-start / convergence plumbing in diff_diff/utils.py:L1301-L1715 and rust/src/weights.rs:L121-L555, is consistent with the methodology registry in docs/methodology/REGISTRY.md:L1497-L1552 and with the official Code Quality No findings. Performance No code-level findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
|
Single actionable P3 from R9 CI review: user-facing runtime wording for refit bootstrap had diverged across surfaces, giving conflicting expectations about the cost of the new bootstrap path: - CHANGELOG.md and diff_diff/synthetic_did.py said ~5-30x slower. - diff_diff/power.py said ~10-100x slower (two sites). - docs/choosing_estimator.rst said ~10-100x slower. - docs/performance-scenarios.md said ~10-100x slower. - docs/methodology/REGISTRY.md coverage-MC block said ~10-100x slower. - docs/tutorials/03_synthetic_did.ipynb and docs/tutorials/18_geo_experiments.ipynb said ~10-100x slower. - benchmarks/python/coverage_sdid.py said the 500-seed MC run takes ~2-4 hours, while REGISTRY.md said ~15-40 min (the actually-observed wall-clock; aer63 is ~37 min, balanced + unbalanced ~2 min combined). Unify on "~5-30x slower than placebo (panel-size dependent)" for the per-fit slowdown (the warm-start plumbing closed the gap vs the pre- warm-start cold-start estimate of 10-100x) and on "~15-40 min" for the coverage MC wall-clock. The CHANGELOG entry already notes the 10-100x figure as a historical "prior estimate" — left as-is so the release notes continue to explain the revision. Also fix two tutorial surfaces that still called placebo "R's default" (tutorial 03, sections 7 and 10). R's default is bootstrap; placebo is the library default per the REGISTRY Note added in 710f966. Reword to describe placebo as the library default with the rationale pointer. Verified: 353 tests pass across test_methodology_sdid, test_power, test_guides (UTF-8 fingerprint preserved). Tutorial-18 nbmake drift guards unaffected because the change is markdown-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static review only; I could not execute tests in this workspace because Executive Summary
Methodology No unmitigated findings. The refit bootstrap path in diff_diff/synthetic_did.py:594, diff_diff/synthetic_did.py:846, and docs/methodology/REGISTRY.md:1497 is consistent with the source-material contract: official Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
|
The Pure Python Fallback CI job failed this test at rel-diff 0.5310 > 0.40 tolerance. Root cause is test-infrastructure, not a correctness regression. ci_params.bootstrap(min_n=...) silently caps min_n at 49 in pure-Python mode to keep CI fast (see tests/conftest.py:210); the test's 0.40 tolerance was explicitly calibrated for B∈[100, 200] per its docstring comment. At B=49 the bootstrap SE is not yet converged to the placebo SE (rel-diff 0.5310 at B=49; 0.3856 at B=100; 0.2708 at B=200 on the same seed), so the failure is MC-noise, not a regression. The 15 Rust-backed matrix jobs (macOS/Linux x86/Linux ARM/Windows × 3 Python versions) all run the test at the full B=200 and pass with comfortable margin — the regression guard is still exercised on the default user install path. Skip under pure-Python mode with an explicit rationale citing the min_n cap and the Rust-backed coverage that preserves the contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static diff review only; Executive Summary
Methodology No unmitigated findings. The changed refit bootstrap path at diff_diff/synthetic_did.py:846 is consistent with the updated methodology registry at docs/methodology/REGISTRY.md:1497, and bootstrap/jackknife inference still routes through Code Quality
Performance No findings. The refit-bootstrap slowdown is explicit and documented, not an accidental regression. See CHANGELOG.md:15. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
|
R10 CI review found two items on top of the previous ✅ Looks good. P2 Code Quality — aggregate Frank-Wolfe non-convergence warning numerator/denominator mismatch. In ``_bootstrap_se``, ``fw_nonconvergence_count`` was incremented before the draw cleared the ``np.isfinite(tau)`` gate. A draw that failed FW convergence AND then produced non-finite τ would count toward the warning numerator while the denominator is ``n_successful`` (draws that cleared the finite-τ gate). That does not affect the reported SE, but it can overstate the documented "share of valid bootstrap draws" warning contract and cause the warning to over-trigger. Fix: move the increment inside the ``if np.isfinite(tau)`` block so the numerator only counts draws that also contribute to the SE. A draw failing the finite-τ gate is retried upstream and should not inflate the non-convergence rate. P3 Documentation (previously unresolved) — two survey-cross-reference docs still advertised SyntheticDiD Rao-Wu bootstrap support, which the estimator now rejects at fit-time with NotImplementedError: - ``docs/methodology/survey-theory.md:725`` — rewrite the Rao-Wu bullet to exclude SDID explicitly, with a pointer to the REGISTRY sketch for the deferred weighted-FW composition and to pweight-only placebo/jackknife as the available SDID variance alternatives. - ``docs/tutorials/16_survey_did.ipynb`` cell-35-f1ef376c — update the support-matrix table so SDID's row reads "pweight only (placebo / jackknife)" with bootstrap struck out, and add a "Note on SyntheticDiD" below explaining which methods accept pweight-only and why bootstrap rejects all survey designs (weighted- FW derivation tracked in TODO.md). Test coverage unchanged: TestBootstrapSE ran the full 7 under Rust with 48-of-50 non-convergence warning still firing on the regression test, confirming the warning numerator still tallies correctly after the gate-order change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static diff review only; Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R11 CI review flagged that compute_time_weights and compute_sdid_unit _weights describe the new return_convergence parameter but their Returns sections still advertise only np.ndarray, not the tuple return. Update both docstrings to mirror _sc_weight_fw's union-return shape: Returns now read "np.ndarray or Tuple[np.ndarray, bool]" with an explicit description of the two-pass AND convergence flag (True iff both pre-sparsify and main FW passes hit the min-decrease break; False if either hit max_iter). Also promote the return_convergence kwarg description from the implementation-detail placement to the main Parameters block with the same contract note as _sc_weight_fw so standalone callers see the legacy-ABI preservation guarantee and the Rust top-level fast-path skip. Docs-only change; no code or test changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static diff review only; Executive Summary
Methodology No findings. The changed estimator math, SE formula, retry-to- Code Quality No findings. Performance No findings. The slowdown from fixed-weight bootstrap to per-draw refit is explicit and documented rather than silent. Maintainability No findings. Extracting Tech Debt
Security No findings. Documentation/Tests
|
Foundation for restoring SDID survey-bootstrap support (PR #352, follow-up to #351 which front-door rejected all survey designs). This commit adds the weighted-FW kernel + Python wrappers; the bootstrap integration lands in the next commit. Rust (rust/src/weights.rs, rust/src/lib.rs): - New `sc_weight_fw_gram_weighted` and `sc_weight_fw_standard_weighted` loop variants. Identical to the unweighted loops except for the regularization term: `half_grad[j]` picks up `eta*reg_w[j]*lam[j]` in place of `eta*lam[j]`, and the FW step-size denominator uses the diag(reg_w)-weighted simplex direction norm `Σ_j reg_w[j]*d[j]²` (which simplifies to `Σ_j reg_w[j]*lam[j]² + reg_w[i] - 2*reg_w[i]*lam[i]` for d = e_i - lam). - New `sc_weight_fw_weighted_internal` dispatcher that delegates to the unweighted internal when reg_weights is None (preserves the legacy numeric contract for any future caller that wants the generic shape). - Two new pyfunctions: `sc_weight_fw_weighted` and `sc_weight_fw_weighted_with_convergence`. Same call shape as the existing unweighted siblings plus a trailing `reg_weights` kwarg. Registered in lib.rs. - 3 new Rust unit tests in rust/src/weights.rs: * test_weighted_fw_reg_weights_none_delegates — bit-identity at rel=1e-14 against the unweighted internal. * test_weighted_fw_uniform_reg_weights_matches_unweighted — uniform rw=1 collapses to uniform regularization (rel=1e-12, allowing for ULP-scale drift from different float reduction orders). * test_weighted_fw_simplex_invariants — for arbitrary positive rw and both gram (T0<N) and standard (T0>=N) paths, returned ω sums to 1 and is non-negative. Python (diff_diff/utils.py, diff_diff/_backend.py): - Export _rust_sc_weight_fw_weighted and _with_convergence from _backend (mirrors the shape added for _rust_sc_weight_fw_with_convergence in PR #351 c0d089b). - Extend `_sc_weight_fw` and `_sc_weight_fw_numpy` with a `reg_weights: Optional[np.ndarray] = None` kwarg. When set on the Rust path, dispatches to the new weighted pyfunctions; on the pure-Python path, runs a weighted FW loop mirroring the Rust derivation. - New helper `compute_sdid_unit_weights_survey(Y_pre_control, Y_pre_treated_mean, rw_control, ...)`: column-scales Y_pre_control by rw_control and passes rw_control as reg_weights so the FW solves the unit-weight survey-bootstrap objective min_{ω simplex} Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - treated_pre[t])² + ζ²·Σ_i rw_i·ω_i² Two-pass sparsify-refit structure mirrors compute_sdid_unit_weights. Returns ω on the standard simplex (caller composes ω_eff downstream). - New helper `compute_time_weights_survey(Y_pre_control, Y_post_control, rw_control, ...)`: row-scales Y_time by sqrt(rw_control) and passes no reg_weights (uniform reg on λ — λ is per-period, rw is per-control, no alignment for per-λ weighting). Two-pass structure unchanged. - Both new helpers expose `return_convergence=True` returning the AND of the two pass convergence flags, mirroring the contract added in PR #351 c0d089b. Tests (tests/test_weighted_fw.py — new, 15 tests): - _sc_weight_fw weighted-reg path: reg_weights=None matches unweighted at bit-identity; uniform reg matches unweighted at rel=1e-12; Rust/numpy parity at rel=1e-9; simplex invariants under arbitrary rw; return_convergence tuple shape. - compute_sdid_unit_weights_survey: uniform-rw equivalence to unweighted helper, simplex invariants under arbitrary rw, shape-mismatch raises, return_convergence AND. - compute_time_weights_survey: same coverage matrix, plus a zero-rw subset test (Rao-Wu-style undrawn PSU yields valid simplex λ). - Backend parity: pure-Python vs Rust weighted-helper output at rel=1e-7 for both unit and time helpers (monkeypatches HAS_RUST_BACKEND). ABI preservation: existing unweighted callers of _sc_weight_fw, compute_sdid_unit_weights, compute_time_weights are unaffected — the new kwarg defaults to None and dispatches to the legacy code path. The bit-identity check on TestScaleEquivariance::test_baseline_parity_small _scale[bootstrap] still passes at rel=1e-14 (verified in the next commit when the bootstrap integration lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sition PR #352 restores the SDID survey-bootstrap capability that PR #351 front- door rejected as a known regression. Pweight-only and full-design surveys now both succeed; placebo / jackknife continue to reject full designs (a separate methodology gap tracked in TODO.md). `diff_diff/synthetic_did.py::fit` (guards): - Replace the unconditional strata/PSU/FPC NotImpl guard with a method- gated version that fires only for placebo / jackknife. Rationale + truth-table in REGISTRY.md §SyntheticDiD survey-support matrix: method pweight-only strata/PSU/FPC bootstrap ✓ (this PR) ✓ Rao-Wu (this PR) placebo ✓ unchanged ✗ NotImpl (separate gap) jackknife ✓ unchanged ✗ NotImpl (separate gap) - Delete the unconditional `bootstrap + any-survey` guard added in #351. Keep the `weight_type != "pweight"` validation (fweight/aweight still rejected). `diff_diff/synthetic_did.py::fit` (survey resolution): - After validating the per-unit survey weights (`w_treated`, `w_control`), also collapse the observation-level `resolved_survey` to a unit-level view via `collapse_survey_to_unit_level(...)` ordered as `[*control_units, *treated_units]`. The resulting `resolved_survey_unit` is what `_bootstrap_se` slices via `boot_rw[:n_control]` / `boot_rw[n_control:]` per Rao-Wu draw. `diff_diff/synthetic_did.py::fit` (dispatcher): - Branch the bootstrap call on whether the design is pweight-only or full design (strata/PSU/FPC). Pass `w_control`/`w_treated` for pweight-only, `resolved_survey=resolved_survey_unit` for full design, None/None for non-survey. `diff_diff/synthetic_did.py::_bootstrap_se`: - New kwargs: `w_control`, `w_treated`, `resolved_survey` (all keyword- only, default None — preserves the legacy signature). - Single-PSU short-circuit: unstratified survey with <2 PSUs returns (NaN, []) since the bootstrap distribution is unidentified (resampling one PSU yields the same subset every draw). Recovered from the pre-PR-#351 fixed-weight Rao-Wu branch (commit 91082e5). - Per-draw Rao-Wu rescaling for full designs: ``rw = generate_rao_wu_weights(resolved_survey, rng)`` sliced over the resampled units. Pweight-only path uses ``rw = w_control[boot_idx]`` (constant per draw, no rescaling). - Survey-weighted treated-unit means: ``np.average(..., weights=rw_treated_draw)`` when survey weights are present. - Warm-start: the simplex init scales by rw before sum_normalize when on the survey path, matching the per-draw weighted-FW geometry. - Per-draw FW dispatch: survey paths call the new ``compute_sdid_unit_weights_survey`` / ``compute_time_weights_survey`` helpers (PR #352 commit 1) which run the weighted-FW kernel; non- survey paths continue to call the unweighted helpers (bit-identity preserved on the non-survey refit path). - Post-FW composition: ``ω_eff = rw·ω / Σ(rw·ω)`` for the SDID estimator (which expects simplex weights). Degenerate-retry if ``Σ(rw·ω) <= 0`` (all mass on rw=0 controls). - Aggregate FW non-convergence warning: tally is the AND of the two helpers' convergence flags per draw, fires above 5% (PR #351 c0d089b shape preserved, no copy change). Tests: - ``tests/test_survey_phase5.py``: rewrite three PR #351 raises-tests as succeeds-tests with explicit SE assertions — * ``test_full_design_bootstrap_succeeds`` (was ``_raises``): finite SE, populated survey_metadata.n_strata/n_psu, summary() includes Survey Design + Bootstrap replications blocks. * ``test_bootstrap_with_pweight_only_succeeds`` (was ``_raises``): finite SE, variance_method preserved (cross-surface guard). * New ``test_bootstrap_full_design_se_differs_from_pweight_only`` resurrects the PR #351 R3-deleted differs-from contract: ATT matches between paths (both compose ω_eff post-fit) but SE differs (Rao-Wu adds PSU clustering variance). - ``tests/test_methodology_sdid.py::TestBootstrapSE``: rewrite two PR #351 raises-tests as succeeds-tests, plus add the ``test_bootstrap_single_psu_returns_nan`` short-circuit regression. Verified: 308 tests pass across test_methodology_sdid / test_business_report SDID subset / test_rust_backend / test_survey_phase5 / test_weighted_fw / test_guides. Bit-identity check: the non-survey refit path goes through the unweighted helpers (no weighted-FW dispatch), so ``TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]`` remains at rel=1e-14 — verified passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Capstone of PR #352. Validates the new weighted-FW + Rao-Wu bootstrap composition and propagates the landed capability across the documentation surfaces. Coverage MC harness (benchmarks/python/coverage_sdid.py): - Add ``stratified_survey`` as a 4th DGP in ``ALL_DGPS``. Uses ``generate_survey_did_data`` to produce an N=40 (strata=2, PSU=2/ stratum) null-treatment panel with moderate weight variation and modest ICC (``psu_re_sd=1.5``). Cohort 7 → post = 7..11 (5 post periods). Converts per-observation ``treated`` to a unit-level ever-treated indicator (SDID's block-treatment requirement). - Extend ``DGPSpec`` with an optional ``survey_design_factory`` callable that returns ``(SurveyDesign, supported_methods_tuple)``. For ``stratified_survey``: bootstrap only — placebo / jackknife reject strata/PSU/FPC at fit-time, so the harness skips them rather than catching the NotImplementedError inside ``_fit_one``. - ``_fit_one`` gains an optional ``survey_design`` kwarg routed through ``SyntheticDiD.fit(survey_design=)``. ``_run_dgp`` calls the factory once per seed (DataFrame contents don't affect columns) and gates methods on the supported set. Regenerated ``benchmarks/data/sdid_coverage.json`` via ``python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200``. Total wall-clock 2421 s (~40 min on M-series Mac, Rust backend); aer63 remains the long tail at 2237 s, stratified_survey adds only 33 s. Calibration gate (plan §2.7): ``stratified_survey × bootstrap`` at α=0.05 returns 0.042 (500 seeds × B=200), inside the calibration band [0.02, 0.10]. ``mean SE / true SD = 1.25`` indicates the bootstrap is slightly conservative (overestimates empirical sampling SD by ~25%) — the safer direction under Rao-Wu rescaling with only 4 PSUs total. Validates the weighted-FW + Rao-Wu composition end-to-end. REGISTRY.md §SyntheticDiD: - Add ``stratified_survey`` row to the coverage MC table and a paragraph under it documenting the calibration verdict, the conservatism direction, and why placebo/jackknife rows are NaN. - Replace the survey-support bullet with a truth-table matrix (PR #352 shape); add a ``Note (survey + bootstrap composition)`` documenting the weighted-FW objective (unit and time forms), the ω_eff composition, the argmin-set caveat, the per-draw rw dispatch (pweight-only vs Rao-Wu), and the single-PSU short-circuit. - Update the ``Note (default variance_method deviation from R)`` to drop the "bootstrap rejects surveys" framing (no longer accurate). - Update the ``Note (coverage Monte Carlo calibration)`` header to say "4 representative null-panel DGPs" and flag stratified_survey as bootstrap-only. User-facing docs: - ``docs/methodology/survey-theory.md``: restore SDID in the Rao-Wu Rescaled Bootstrap list; describe the weighted-FW composition. - ``docs/survey-roadmap.md``: Phase 5 SDID row updated to reflect full-design bootstrap support via PR #352; Phase 6 Rao-Wu bullet restores SDID. - ``docs/tutorials/16_survey_did.ipynb`` cell-35: support matrix table row for SyntheticDiD switches from "pweight only (placebo/ jackknife)" to "bootstrap only (PR #352) for strata/PSU/FPC"; "Note on SyntheticDiD" block rewritten for the landed contract. - ``diff_diff/synthetic_did.py`` ``__init__`` docstring: bootstrap bullet now describes survey support and the ω_eff composition. - ``diff_diff/guides/llms-full.txt``: survey-aware bootstrap bullet includes SDID in the Rao-Wu list with the weighted-FW formula. CHANGELOG.md: - Retain the PR #351 regression Changed entry but annotate it as "restored in PR #352"; add new Added/Changed PR #352 entries documenting the weighted-FW kernel, survey helpers, _bootstrap_se Rao-Wu composition, and the new coverage MC row. TODO.md: - Row 103 (SDID + survey designs) → closed by PR #352; replaced with a narrower follow-up for placebo/jackknife + strata/PSU/FPC (Low priority, no concrete sketch yet). Tests: - ``TestCoverageMCArtifact`` extended: 4 DGPs asserted (including ``stratified_survey``); new explicit assertions that the stratified_survey bootstrap row has ≥100 successful fits and α=0.05 rejection ∈ [0.02, 0.10]; placebo/jackknife rows n_successful_fits == 0 (strata/PSU/FPC rejection contract). Verified: TestCoverageMCArtifact passes against the regenerated artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
variance_method="bootstrap_refit"for SyntheticDiD — Arkhangelsky et al. (2021) Algorithm 2 step 2. Re-estimates ω̂_b and λ̂_b via two-pass sparsified Frank-Wolfe on each pairs-bootstrap draw, using the fit-time normalized-scale zeta. Opt-in; default remains"placebo".variance_method:results.py:960summary line,business_report.py:602BR inference-label,synthetic_did.py:695n_bootstrapresult population,power.pySDID guidance strings,SyntheticDiD.__init__docstring,diff_diff/guides/llms-full.txt. Survey + refit raisesNotImplementedErrorupstream infit()— Rao-Wu rescaled-weight composition is tracked as a follow-up TODO.benchmarks/python/coverage_sdid.py) runs 500 seeds × B=200 × 3 DGPs × 4 methods under H0 and writesbenchmarks/data/sdid_coverage.json. Rejection rates at α ∈ {0.01, 0.05, 0.10} and mean-SE / true-SD ratios are transcribed intoREGISTRY.md§SyntheticDiD. Headline: refit achieves near-nominal calibration (rej@0.05 ≈ 0.04–0.08) across all 3 DGPs; fixed-weight over-rejects by ~1.8–3.2× on smaller panels; placebo is also near-nominal; jackknife is slightly anti-conservative on smaller panels (matches Arkhangelsky §6.3's reported 98% / 93% coverage pattern).SyntheticDiD.set_paramsvalidation gap: the sklearn-style setter path bypassed constructor validation for the new enum. Extract a shared_validate_confighelper; makeset_paramstransactional so a validation failure rolls back touched attributes rather than leaving the instance partially mutated.Methodology references (required if estimator / math changes)
variance_method="bootstrap_refit"variant.Synthdid.jl::src/vcov.jl:96-103is the only existing refit-bootstrap implementation; Rsynthdid::vcov(method="bootstrap")and Statasdid.ado:1033-1037both use the fixed-weight shortcut. Full methodology surface + requirements checklist row indocs/methodology/REGISTRY.md§SyntheticDiD.bootstrap_refitraiseNotImplementedError— Rao-Wu rescaled weights composed with FW re-estimation needs its own derivation (paper is un-survey; R has no survey support). Documented in REGISTRY.md and tracked in TODO.md. FW non-convergence warnings are aggregated into a single summaryUserWarningat end-of-loop if the rate exceeds 5% (same threshold as retry-exhaustion), rather than emitting one per bootstrap draw; this preserves the warning signal without 200+ per-fit spam. No deviations from the paper's estimator math, SE formula (sqrt((r-1)/r) × sd(ddof=1)— unchanged), or p-value dispatch (analytical from bootstrap SE — unchanged from PR Fix SyntheticDiD bootstrap p-value dispatch and SE formula #349).Validation
tests/test_methodology_sdid.py::TestBootstrapRefitSE— 8 real-fit tests (positive SE, diverges from fixed, tracks placebo on exchangeable DGP, raises on pweight + full-design survey, analytical p-value dispatch, enum validation, summary renders replications)tests/test_methodology_sdid.py::TestPValueSemantics::test_refit_p_value_matches_analytical— mirror of the fixed-weight bootstrap regression guardtests/test_methodology_sdid.py::TestGetSetParams— 4 new set_params validation tests (accept bootstrap_refit, reject invalid enum, reject incoherent n_bootstrap, allow n_bootstrap=1 under jackknife) + 1 transactional-rollback testtests/test_methodology_sdid.py::TestCoverageMCArtifact— schema smoke test on the MC JSON (guarded withpytest.skipif absent perfeedback_golden_file_pytest_skip.md)tests/test_business_report.py::TestSyntheticDiDBootstrapRefitInferenceLabel— cross-surface guard that BR emits"bootstrap_refit variance"on alpha-override, not the analytical fallback labelTestJackknifeSERParity::test_bootstrap_se_matches_r) still passes — fixed-weight path byte-identical to baseline despite the_bootstrap_sesignature expansionbenchmarks/data/sdid_coverage.json(500 seeds × B=200 × 3 DGPs × 4 methods, generated on M-series Mac + Rust backend in ~40 min). Headline calibration table rendered inREGISTRY.md§SyntheticDiD:Security / privacy
Generated with Claude Code